CProMG: controllable protein-oriented molecule generation with desired binding affinity and drug-like properties

Abstract Motivation Deep learning-based molecule generation becomes a new paradigm of de novo molecule design since it enables fast and directional exploration in the vast chemical space. However, it is still an open issue to generate molecules, which bind to specific proteins with high-binding affinities while owning desired drug-like physicochemical properties. Results To address these issues, we elaborate a novel framework for controllable protein-oriented molecule generation, named CProMG, which contains a 3D protein embedding module, a dual-view protein encoder, a molecule embedding module, and a novel drug-like molecule decoder. Based on fusing the hierarchical views of proteins, it enhances the representation of protein binding pockets significantly by associating amino acid residues with their comprising atoms. Through jointly embedding molecule sequences, their drug-like properties, and binding affinities w.r.t. proteins, it autoregressively generates novel molecules having specific properties in a controllable manner by measuring the proximity of molecule tokens to protein residues and atoms. The comparison with state-of-the-art deep generative methods demonstrates the superiority of our CProMG. Furthermore, the progressive control of properties demonstrates the effectiveness of CProMG when controlling binding affinity and drug-like properties. After that, the ablation studies reveal how its crucial components contribute to the model respectively, including hierarchical protein views, Laplacian position encoding as well as property control. Last, a case study w.r.t. protein illustrates the novelty of CProMG and the ability to capture crucial interactions between protein pockets and molecules. It’s anticipated that this work can boost de novo molecule design. Availability and implementation The code and data underlying this article are freely available at https://github.com/lijianing0902/CProMG.


Introduction
During the drug design, it is essential to screen or design candidate compounds binding to protein targets. However, it is extremely difficult to find appropriate small molecules in the vast chemical space, including 10 23 -10 60 compounds as estimated (Polishchuk et al. 2013). In past years, highthroughput screening (Macarron et al. 2011) and virtual screening (Schneider and Bö hm 2002) are two classical techniques of computer-aided drug design, which search candidate molecules in predefined compound libraries. However, they only perform limited searching in chemical space such that their finding molecules are not novel due to predefined small-size molecule libraries. In recent years, biologists and pharmacologists have been paying attention to various deep generative models, which have been successfully used in computer vision and natural language processing. They believe that the design of novel small molecules via deep generative models (called molecule generation) can explore the entire chemical space. Molecule generation provides a new paradigm of de novo molecule design.
Current deep learning-based molecule generation methods can be roughly categorized into ligand-based and proteinbased methods.
Since RNN naturally processes variable-length sequences, it can generate novel molecules by representing molecules in SMILES strings. Furthermore, aiming to refine generated molecules having desired drug-like properties, RNN-based methods always employ diverse optimization strategies. For example, ChemTS directly applies an RNN to generate novel molecules, which are further optimized by a tree search to find molecules having specific drug-like properties (Yang et al. 2017). Based on the pretraining strategy, RNN can be fine-tuned by transfer learning (Segler et al. 2018) or reinforcement learning (Wang et al. 2021) to generate propertyspecific molecules. In contrast to these optimization strategies, the training of a conditional RNN by setting its initial state with specific molecular properties can directly generate novel property-specific molecules (Kotsias et al. 2020). However, RNN is designed for sequences but not for graphs (e.g. molecule structures).
GAN and VAE are two typical distribution-based generative models, which can characterize the small molecule space. GAN contains a generator and a discriminator, contesting with each other by a zero-sum (adversarial) game. They are trained together in an adversarial manner which enables the generation of novel molecules. ORGAN (Guimaraes et al. 2018) utilizes a SMILES-based GAN to generate molecules. The pioneering graph-based model, MolGAN (De Cao and Kipf 2018), employs a GAN to directly generate molecular graphs. A reinforcement learning module is a popular strategy to help generate molecules with specific properties. However, the training of GAN usually suffers from mode collapse.
VAEs are generative encoder-decoder models under explicit normal distribution assumptions. Since the latent distribution space is analogous to the chemical space, the designated sampling in it enables the generation of novel molecules owning specific properties. There are various approaches to generate novel molecules with desired properties, such as an extra property predictor (Gó mez-Bombarelli et al. 2018) and a conditional VAE (Lim et al., 2018). Considering molecule structures contain richer information than SMILES strings, some works directly generate novel molecular structures but not SMILES strings. For example, GraphVAE (Simonovsky and Komodakis 2018) design a graph-based VAE by representing molecules as graphs with attributes. JT-VAE (Jin et al. 2018) combine a tree-structured scaffold over chemical substructures into a molecule with a graph message-passing network. Remarkably, VAE-generated molecules only exhibit moderate novelty and diversity.
Ligand-based methods can generate novel compounds with favorable physicochemical properties. However, since they consider no or less protein information when generating molecules, they cannot guarantee that generated molecules have desired binding affinity to new protein targets.
(2) Protein-based methods In contrast, recent protein-based methods ensure that generated molecules bind to specific protein targets with highbinding affinities. Some methods turn such a generation into a machine translation problem, which translates protein sequences (amino acid sequences) into molecule sequences (SMILE strings). In this context, Transformer can be directly applied for protein-based molecule generation (Grechishnikova 2021). AlphaDrug (Qian et al. 2022) improves the vanilla Transformer by multiple skip connections from its protein encoder to its molecule decoder to obtain better protein representations and further applies the tree search to guide molecule generations. However, only considering protein sequences, these methods neglect the information in binding pockets, which imply how a molecule binds to a protein.
Some methods attempt to utilize protein 3D structures when generating molecules. For example, based on the voxelization of 3D protein pockets and 3D molecule structures, Skalic et al. (2019) trains a GAN to generate 3D shapes of molecules, which are further decoded into multiple candidate SMILES strings by a captioning network. Recently, Xu et al. (2021) construct a protein residue-based Coulomb matrix to directly characterize the spatial structure of the pocket, which is further input into a conditional RNN to control the generation of molecules.
To enhance the protein structure representation, recent works characterize the binding interface between a protein and a molecule. By considering the 3D coordinates of atoms in given binding sites, Luo et al. (2021) design a 3D generative model to estimate the probability density of atom occurrences in the 3D binding space, and perform an autoregressive sampling scheme on the binding spatial locations assigned with higher probabilities to generate molecules atom by atom. But this approach ignores bond types and functional groups in the binding pocket. To solve the problem, its extension, Pocket2Mol (Peng et al. 2022) designed an E(3) equivariant neural network to capture spatial and bonding relationships between atoms in the binding pocket.
However, the representation of 3D structures is still challenging so far. In terms of binding affinity, the molecules they generated are surprisingly lower than those generated by 1D sequences (Qian et al. 2022). More importantly, it is difficult to generate small molecules w.r.t. drug-like physicochemical properties under control.
To address the above issues, we elaborate a proteinoriented generative framework (CProMG), which contains a 3D protein embedding module, a dual-view protein encoder, a molecule embedding module, and a novel drug-like molecule decoder. Overall, the main contributions of our CProMG are as follows.
1) It serves in a controllable learning framework to generate novel small molecules having high-binding affinities to specific protein targets while owning desired drug-like properties.
2) It provides a better representation of 3D protein structure (pocket) by integrating a fine-grained atom view with a coarse-grained amino acid view based on an interactive attention block in the encoder. 3) It leverages the protein-interactive multi-head attention block in the decoder to calculate the proximity of molecule tokens to protein residues and atoms, such that crucial interactions between protein pockets and molecules can be captured.

Problem formulation and model construction
Suppose that m proteins P ¼ fp i ; i ¼ 1; 2; . . . ; mg bind to n small molecules C ¼ fc j ; j ¼ 1; 2; . . . ng. Let a i;j be the binding affinity of p i with respect to c j . In addition, c j has specific physicochemical properties y j 2 R 1Âdy ; y j ðtÞ 2 f1; 0g or y j ðtÞ 2 R; j 2 f1; 2; . . . ng. The former type of y i ðjÞ indicates a hard (binary/discrete) property of c j (e.g. Synthetic Accessibility, SA), while the latter represents its soft (continuous) properties (e.g. logP). For example, a molecule entry, named PF-4989216, assigned with the compound ID 51033720 in PubChem, has a value of LogP ¼ 2.919. Meanwhile, it has a good SA score (i.e. 0.78). In practice, since pharmacologists are more interested in whether the molecule can be synthesized easily, SA is binarized by the rule that SA ¼ 1 if SA 4:0 (i.e. easy to be synthesized), otherwise SA ¼ 0 (i.e. difficult to be synthesized) (Wang et al. 2021). We consider two types of molecule properties simultaneously when generating novel molecules.
Given a new protein p x , the task is to generate a set of novel molecules fc k x ; k ¼ 1; 2; . . .g, which bind to p x with highbinding affinities and have desired physicochemical properties y x . Inspired by Transformer (Vaswani et al. 2017), we treat this task as a specific translation from proteins into small molecules. We design a novel protein-oriented molecule generation framework, including a 3D protein graph embedding module, a dual-view protein encoder, a drug-like molecule embedding module, and a novel molecule decoder (Fig. 1).

Protein graph embedding module
Inspired by the hierarchy of protein structure (Jin et al. 2022;Wang et al. 2023), we characterize 3D protein structures in both an amino acid view and an atom view. To reduce the computation, only binding pockets are considered when characterizing 3D protein structures. (a) Protein graph construction Technically, given a protein p, its 3D structure can be represented as a graph G ¼ ðV; EÞ. Specifically, V ¼ fðv i ; r i Þg n i¼1 is the node set, where a node v i has known 3D coordinates r i 2 R 3 and n denotes the number of nodes. Moreover, E ¼ fe ij ; i; j ¼ 1; 2; . . . ; n & i 6 ¼ jg denotes a set of edges between nodes. We build an amino acid residue-based graph (G r ) and an atom-based graph (G a ), respectively.
In the residue-based graph G r , we treat amino acid residues as the nodes V. Each node v i is naturally represented as a one-hot coding vector x i according to 20 amino acid types. The edges between them are determined by their Euclidean distances. In detail, being the representative point of v i , its centroid is first calculated by the atom coordinates in v i and m k i n o are its atomic masses accordingly. Then, the pairwise Euclidean distance between v i and v j is calculated by d i;j ¼ c i À c j j j j j 2 . It is used further to construct edges e ij ; j 2 N i È É by the K-nearest neighbor (KNN) algorithm, which selects k nearest neighbor nodes v j 2 N i È É (e.g. k ¼ 48), where N i is the node neighborhood of v i . Last, d i;j is set as the initial representation of e ij . The residue view of a protein provides a coarse-grained representation of its binding pocket.
In the atom-based graph G a , we treat atoms as the nodes V. Similarly, each atom is represented by the one-hot encoding based on six popular atom types, including H, C, N, O, S, and P. Moreover, each atom in the protein backbone is annotated by an additional bit, where 1 indicates its location being in the backbone, and 0 otherwise. Thus, each node v i is represented as a 7-dimensional vector (x i ). We determine an edge between two nodes in a similar way as that in G r but with a different number of nearest neighbors (i.e. K ¼ 30) as recommended by Ingraham et al. (2019). The atom view provides a fine-grained representation of its binding pocket.
Once the protein graphs are built, our task is to generate the embeddings of nodes and edges. In common, suppose that each node v i has the initial representation x i 2 R 1Âdv and each edge e ij has the initial representation where h pos i 2 R 1Âd is the Laplacian positional encoding vector of v i , and W 0 ð Þ 2 R dvÂd is the learnable parameter. See Section 2.2(b) for the definition of h pos i . Inspired by the idea analogous to RBF neural network (Seshagiri and Khalil 2000), we obtain the embeddings of e ij by d e RBFs mapping its initial representation d i;j into e ij 2 R 1Âde . Figure 1. The framework of CProMG. This framework is composed of four modules, a 3D protein graph embedding module, a dual-view protein encoder, a drug-like molecule embedding module, and a novel molecule decoder. (a) Protein embedding module. A protein (pocket) is represented in a residue graph and an atom graph in parallel. Nodes and edges in each protein graph are embedded. Especially, nodes have additional Laplacian positional encodings. Node representations are also augmented by edge representations. (b) Dual-view protein encoder. It contains two parallel encoder modules w.r.t. protein graph, of which each module is composed of t encoding blocks. Each block contains a multi-head self-attention unit and a feedforward neural network. There are also two cross-attention units between the parallel encoder modules. The concatenation of representations of two encoder modules is output as the protein representation and input into the molecule decoder as the key and value. (c) Molecule embedding module. It encodes physicochemical properties of small molecules, docking scores w.r.t. proteins, and their SMILES sequences simultaneously. The concatenation of them is added with an extra positional encoding as the Query input into the decoder. (d) Molecule decoder. It contains t decoder blocks, each of which contains a masked multi-head attention unit, a cross-attention unit, and a feed-forward network. The decoder autoregressively predicts the next token of the molecular sequence through the generated molecular intermediates and proteins representation. i328 Li et al.
When we attempt to encode each node in a protein graph to obtain a unique positional representation of each node, however, it's hard to directly define the positions of nodes in a graph. In other words, we cannot apply the positional coding in the vanilla Transformer to the protein graphs. To cope with this issue, we borrow the idea of the Laplacian position encoding in graph neural networks to obtain unique positional representations of nodes in the following.
(b) Laplacian positional encoding Because any signal can be represented as a combination of sine/cosine functions with varying frequencies, Transformer regards the positional coding as a Fourier Transform on a signal. As a result, each entity in turn is coded into a positionunique vector. However, such positional coding cannot be directly applied to graphs because it's hard to define the positions of nodes in a graph. To cope with this issue, we leverage the Laplacian position encoding in graph neural networks (Kreuzer et al. 2021) to assign each node in a graph with a unique representation.
Given a weighted graph G ¼ ðV; E; WÞ, the weight w ij of each edge e ij is defined as w ij ¼ e Àd 2 ij = 2r 2 ð Þ , where the hyperparameter r is empirically set as 30 in G r and 15 in G a respectively. Its Laplacian matrix L 2 R nÂn can be defined as follows (Dwivedi et al. 2022): where I 2 R nÂn is the identity matrix, the n Â n diagonal matrix D represents the degree matrix of weighted graph G, its kth element d k;k is the degree of the k-th node and A represents the weighted adjacency matrix of G, the n Â n diagonal eigenvalue matrix K contains eigenvalues k k f g from small to big along with its diagonal, U 2 R nÂn contains a set of eigenvectors u k 2 R nÂ1 È É w.r.t. k k f g and u k is normalized to unit length (i.e. u T k u k ¼ 1). Thus, L enables the Fourier Transform on the graph G. Specifically, the eigenvectors u k f g of L, analogous to sine/cosine functions (Dwivedi and Bresson 2021), can be regarded as the basis vectors to encode the positions of nodes in G. The eigenvalue is considered as a node position in the Fourier domain of the graph (Bronstein et al. 2017).
In the spectral graph theory, eigenvalues can be used to discriminate between different graph structures and substructures, as they can be interpreted as the frequencies of resonance of the graph (analogous to the frequencies reflected by sine/cosine functions again) (Kreuzer et al. 2021). Accordingly, smaller eigenvalues (frequencies) are more heavily weighted when determining distances between nodes. Moreover, corresponding low-frequency eigenvectors are spread across the graph, while higher frequencies often resonate in local structures (Kreuzer et al. 2021). Therefore, we take low-frequency eigenvectors of nodes w.r.t. the first ksmallest eigenvalues [e.g. k ¼ 8 (Kreuzer et al. 2021)] as their positional features.
For each node v i , its positional encoding h pos i is defined as: where Uði; 1 : kÞ is the positional vector consisting of the first k elements in the i-th row of U, and the learnable W pos 2 R kÂd works like an adapter to map the positional coding from the eigenspace to the node embedding space. Such a coding can capture the intuition that nodes far apart are different whereas nodes nearby are similar in terms of positional features.

Dual-view encoder
The embeddings of the amino acid graph G r and that of the atom graph G a are input separately into the dual-view encoder to obtain the final representation of the protein binding pocket. The dual-view encoder contains two parallel encoders En r and En a accounting for the encodings of two graphs G r and G a , respectively. Each encoder is composed of t tandem encoding units, of which each contains an edge-augmented encoding block E t a and a multi-head attention block M t a . The first block E t a enhances node representations while the attention block M t a further updates node representations by a selfattention mechanism.
Remarkably, G r and G a represent the coarse-grained (residues) and the fine-grained (atoms) information of the protein binding pocket respectively. As a result, the dual-view encoder also leverages two cross-attention blocks between En r and En a to fuse the coarse-grained representation of the protein binding pocket with its fine-grained representation. Such a fusion helps capture the natural protein structure hierarchy.
(1) Graph encoders Technically, suppose that node v i and its neighboring nodes v j f g , where j 2 N i and N i is the neighborhood of v i . For the l-th encoding unit, the edge-augmented encoding block E l a enhances their representations to accommodate the selfattention framework by two steps. The first step maps the node representation h   n o is output and used as the input of the next encoding unit if l < t. Last, supposing that the node representation matrices derived from two encoders En r and En a are H ðtÞ and Z ðtÞ ; respectively, we vertically stack them as the final representation matrix H P of the protein structure (i.e. H P ¼ H ðtÞ ; Z ðtÞ Â Ã ).
(2) Dual-view fusion During the parallel encoding process, we designed a oneway cross-fusion block between the l-th encoding unit in En r and that in En a , which updates coarse-grained H ðlÞ by fine- Similarly, the concatenation is mapped further by another linear layer asĥ , where gðÁÞ is the normalization function. To reduce the information redundancy, we only build two cross-fusion blocks for a middle encoding unit (e.g. the 3rd unit) and the last encoding unit (e.g. the 6th unit), respectively.

Molecule embedding module
The molecule embedding module encodes molecule sequences based on a pre-built vocabulary and encodes their drug-like properties as well as binding affinities w.r.t. proteins simultaneously. The molecule decoder can generate novel molecules owning desired properties in a controllable manner.
We utilize the tokenization proposed by Schwaller et al. (2018) to build a vocabulary V, which contains k nonoverlapping "words" (substrings in the SMILE string, or called tokens), such that each SMILES string is turned into a sequence of words.
Formally, given the n-length SMILES sequence of a small molecule c, it can be turned into an n-word sequence s c ¼ fa 1 ; :; a n g, where a i 2 V. To perform the decoding, we add a prefix tag on this word sequence as s Ã c ¼ fb; a 1 ; :; a n g, where b 2 V is the beginning tag when starting the decoding. Accordingly, s Ã c is represented as an ðn þ 1Þ Â k one-hot encoding matrix S based on the vocabulary V.
Let fp 1 ; :; p m g be the property sequence of the molecule c, where each character indicates one of its property names (e.g. Synthetic Accessibility, LogP,. . .), and y 2 R 1Âm be the vector of property values, where yðtÞ 2 f1; 0g or yðtÞ 2 R. The former type of yðtÞ indicates hard properties (e.g. Synthetic Accessibility), while the latter represents soft properties (e.g. logP). To generate novel molecules with better docking in proteins of interest, we binarize the docking scores S of proteinligand pairs as a hard property. Specifically, S ¼ 1 if S À7:5, and 0 otherwise. Thus, we obtain the molecule representation h m 2 R ðnþ2ÞÂd by where ";" is a stacking operation of matrices, W p 2 R mÂd and W s 2 R kÂd account for two linear layers, respectively. Moreover, we add a property tag "p" at the head of s Ã c to indicate the molecule with properties as s p ¼ fp; b; a 1 ; :; a n g. This is a crucial trick to make the generation of novel molecules owning desired properties. See also Section 2.5 for detailed reasons. To describe such a sequence briefly, we regard a tag or a word in it as a token, which is assigned with a binary type indicator (i.e. 1 for property and 0 for word). Accordingly, two token types are also embedded as vectors t 1 ; t 0 2 R 1Âd . The token type representation of the molecule is defined as the stacking of token type embedding w.r.t. s Ã p , h token ¼ t 1 ; t 0 ; . . . ; t 0 ½ : Furthermore, we consider the positional relationship among tokens. Inspired by the Transformer (Vaswani et al. 2017), we use sine and cosine functions of different frequencies to encode the position of the i-th token into a d-dimensional unique representation h pos i 2 R 1Âd as follows: if an odd number. The wavelengths form a geometric progression from 2p to rÁ2p, where r ¼ 10 000 as suggested by Vaswani et al. (2017). Thus, the positional representation of the molecule H pos is just the stack of h pos i È É . Finally, the whole embedding of small molecule c is defined as:

Decoder and molecule generation
We directly adopt the same architecture of the decoder as that of the original Transformer, which contains t tandem decoding units. Each unit is composed of a masked multi-head attention block, a protein-interactive multi-head attention block, and an ordinary multilayer perceptron. The masked module prevents the decoder from information leakage when predicting the next token. The interactive module calculates the proximity of molecule tokens to protein residues and atoms by regarding the former as queries and the latter as keys and values in an attention layer. See also Vaswani et al. (2017) for details. The molecule generation is completed by an autoregressive decoding process, which begins with the sequence of two tokens fp; b; Ã; . . . ; Ãg in and iteratively appends potential tokens a Ã i to it one by one until the ending tag "e" (i.e. fp; b; a Ã 1 ; :; a Ã t ; eg). The resulting sequence a Ã 1 ; :; a Ã t f gis directly taken as the SMILES string of the novel molecule.
As remarked in Shuai et al. (2021) and Madani et al. (2021), the molecular property token should be put in the head of the token string because the essence of the autoregressive decoding is an iterative process under progressive conditional probabilities where properties are the first condition to generate the next molecular token. Such a crucial step guarantees the molecular generation controllable w.r.t. properties.
The training of the model aims to maximize the following negative log-likelihood: where x 0 ¼ ½p; b, and x i is the token in s p . In the generation, the model generates novel molecules based on the learned conditional probability distribution as x Ã i is the generated token. Finally, complete generated token strings s Ã p È É are obtained by top-k high conditional probabilities, and their substrings f gare corresponding SMILES strings, such that protein-oriented novel molecules (with high-binding affinities and desired properties) are generated.

Evaluation metrics
To evaluate the performance of molecule generation models, we follow the conventional settings in recent works (Bagal et al. 2021;Luo et al. 2021), which use Vina Score (VS), High Affinity Ratio (HAR), Quantitative Estimate of Drug-likeness (QED), Synthetic Accessibility Score (SA), Diversity, Water-Octanol Partition Coefficient (logP), Molecular Weight (MW) as the performance metrics. They are introduced as follows.
VS measures the average binding affinity between generated molecules and proteins of interest. We use Autodock Vina (Trott and Olson 2010) to calculate docking scores. Since the docking score is negative, the less, the better.
SA reflects the average difficulty of synthesizing a given molecule by its synthesizable fragments (Ertl and Schuffenhauer 2009). A drug-like molecule usually has SA 4:0. The lower, the easier to be synthesized. HAR indicates the percentage of generated molecules having higher binding scores than those of reference molecules or equal to them. The greater, the better.
QED measures the average similarity between generated molecules and existing drugs by multiple chemical attributes (Bickerton et al. 2012). Its value falls into 0; 1 ½ . The greater, the better. Diversity evaluates the diversity within a group of generated molecules G in terms of chemical structure, and its definition is as Diversity ¼ 1 À 1 N 2 P m1;m22G T m 1 ; m 2 ð Þ , where N represents the number of generated molecules and Tðm 1 ; m 2 Þ represents the Tanimoto similarity between molecule m 1 and molecule m 2 . The greater, the better. logP, the water-octanol partition coefficient, is a ratio of a chemical's concentration in the octanol phase to its concentration in the aqueous phase of a two-phase octanol/water system. According to the Rule of 5(RO5) proposed by Lipinski (Lipinski et al. 2001), logP should be <5.
TPSA refers to the total surface area of all polar atoms. It measures the drug's ability to permeate cell membranes. Molecules with a TPSA > 140Å 2 have a limited ability to permeate cell membranes.
A detailed discussion of property-controllable generation can be found in Section 3.3.

Dataset and parameter setting
We adopted the dataset popularly used in previous works (Luo et al. 2021). Built by Luo et al. (2021) based on binding pose RMSD (i.e. RMSD < 1Å), it contains over 100 000 protein-ligand docking pairs, involving 2922 protein pockets and 13 839 ligand molecules. Each pair has a docking score measured by Autodock Vina (Trott and Olson 2010). Following the procedure proposed by Luo et al. (2021), we first clustered proteins at a sequence identity level of 30% by MMseqs2 (Steinegger and Sö ding 2017), such that two proteins coming from different clusters have 30% sequence identity (i.e. significantly different). Then, we took several clusters (i.e. 25 clusters) out of these clusters as the testing clusters and the remaining as the training clusters respectively. After that, we randomly extracted 100 000 protein-ligand pairs in the training clusters to build the model, where 99 000 pairs are labeled as the training pairs and 1000 pairs as the validation pairs. Last, we randomly selected 100 proteins (involving $18K protein-ligand pairs) from the testing clusters as the testing proteins (i.e. Reference) and assessed the performance of molecule generation w.r.t. proteins significantly different from the training proteins.
We used the training set to tune the learnable model parameters while determining the hyperparameters by empirical suggestions in other works.
Specifically, when constructing amino acid graphs in the protein embedding module, each node was initially represented as a 20-dimensional one-hot feature vector accounting for amino acid types, and the number of its nearest neighbors k ¼ 30 as recommended by Ingraham et al. (2019). When encoding the edges between nodes, we used 64 Gaussian RBFs as suggested by Luo et al. (2021), where 64 centroids were taken at equal intervals between 0 and 25Å and the width parameter of each RBF is the interval size (i.e. 25/64). Thus, each edge in amino acid graphs was represented as a 64-dimensional vector. Similarly, each node of the atom graph was initially represented as a 7-dimensional binary feature vector (Section 2.2) and the number of its nearest neighbors (k) was empirically assigned as 48. We used 64 Gaussian RBFs equidistantly spaced from 0 to 15 Å and set the width parameter to 15/64. As a result, each edge in atom graphs was also represented as a 64-dimensional vector. Finally, as Kreuzer et al. (2021) suggested, we collected the eigenvectors CProMG i331 w.r.t. 8-smallest eigenvalues of the Laplace matrix as position codes.
In the dual-view encoder, each of the encoders contains 6 tandem encoding units, of which each unit is composed of 4 heads of attention layers. The hidden dimensions of both nodes and edges were set as 256. The dimensions of Query and Key in both the encoder and the cross-fusion module were set as 32, while the dimension of Value was set as 64. In addition, the feedforward network contains 1024 neurons.
In the molecule embedding module, each token (including tags and properties) in SMILES strings was initially represented as a 112-dimensional binary vector, including the beginning tag(1-d), the ending tag(1-d), the non-overlapping tokens w.r.t. SMILES strings (110-d) (Section 2.4). In the decoder, we set the length of token strings by the maximum length of SMILES sequences (i.e. 200). In addition, the parameters in the attention module in the decoder adopt the same values as those in the encoder. When training our model, we set the batch size as 4, the initial learning rate a as 1eÀ4, and selected Adam as the optimizer. To accelerate the optimization, we adopted a decay strategy to regulate the learning rate as follows. If the loss of the validation set is not decreased within 5 iterations, a Ã ¼ 0:6a until it reaches 1eÀ5. We validated the model every 1000 training iterations and stopped the training if the loss does not decrease significantly within 20 validation iterations.

Method comparison
We assessed the performance of our CProMG by comparison with five state-of-the-art (SOTA) protein-oriented generative approaches, which including LiGANN (Skalic et al. 2019), 3D-SBDD (Luo et al. 2021), Pocket2Mol (Peng et al. 2022), naïve Transformer-based (Grechishnikova 2021), and AlphaDrug(BS) (Qian et al. 2022). These recently published approaches are briefly summarized as follows. LiGANN trained a GAN to generate 3D shapes of molecules, which match corresponding protein pocket shapes in topological complement, and then decoded the generated ligand shapes into multiple candidate SMILES strings by a captioning network. 3D-SBDD designed a 3D generative model to estimate the probability density of atom occurrences in the 3D binding space, and performed an auto-regressive sampling scheme on the binding spatial locations assigned with higher probabilities to generate 3D coordinates of molecules in a 3D grid atom by atom. Pocket2Mol designed an E(3) equivariant neural network to capture spatial and bonding relationships between atoms in the binding pocket and directly generated 3D coordinates of small molecules in continuous space. The naïve Transformer-based method directly applied the vanilla Transformer to generate novel molecule SMILES strings for specific amino acid sequences. Following this work, AlphaDrug improved the vanilla Transformer by skipping connections from its encoders to decoders. In addition, we employed DUD-E (dude.docking.org) to generate decoy molecules (denoted as Decoy) of the reference ligands which bind to the testing proteins.
Since those approaches adopt the same dataset, to make a fair comparison, we used the default values of parameters as those in the original papers in the comparison. For each protein in the independent testing set, top-10 molecules were generated for comparison.
It is the prime requirement that generated molecules bind to specific proteins with high affinities. Thus, we principally set the expected binding affinity as VS À7:5. Meanwhile, we expected two hard drug-like properties (i.e. QED ! 0:6; SA 4:0) to ensure that generated molecules are of high drug-likeness and easy to be synthesized respectively. A detailed investigation on controlling more properties can be found in Section 3.3. The generation performance was, on average, measured by the first five metrics, including VS, SA, HAR, QED, and Diversity. For both VS and SA, the less, the better. For the remaining, the greater, the better. In addition, we list the average results recorded (denoted as "Reference") in the independent dataset as the baseline.
The results show that our CProMG significantly outperforms the Reference and other generative methods over all the metrics (Table 1, where P-values achieved by two-tailed t-tests are in parenthesis). In addition, since both HAR and Diversity are global metrics, the calculation of P-value is inappropriate for them. Especially, it reveals that our CProMG controlling VS, QED, and SA achieves the lowest VS, the lowest QED, and the highest SA as expected. In contrast, since these SOTA approaches cannot control the generation of molecules in terms of drug-like properties, they achieve sharply worse VS, QED, and SA. Therefore, the comparison demonstrates the superiority of our CProMG.

Property-controllable generation
In this section, we investigate how well CProMG controls the molecule generation w.r.t. drug-like properties in a progressive manner including four scenarios. The first scenario, denoted as CProMG-w/oC, removes the controls of both binding affinity and properties. The second one, denoted as CProMG-V, keeps the control of binding affinity without property control by expecting VS À7:5. The third one, denoted as CProMG-VQS, sets the control of two hard properties QED and SA by expecting QED ! 0:6 and SA 4:0, based on binding affinity control. The last one, denoted as CProMG-VQSLT, sets an extra control of two soft properties LogP and TPSA by expecting LogP ¼ 2:0; 4:0 f gand TPSA ¼ 40:0; 80:0 f g , based on the third scenario. Thus, the last strategy contains four settings. i332 Li et al.
The overall results of the comparison are listed in Table 2 and its details are illustrated by the distributions of metric values in Fig. 2. The comparison reveals significant findings as follows.
1) Even without property control, CProMG can generate molecules, which have approximate properties to those of reference molecules. For example, their QED/SA distribution (orange/green curves in Fig. 2) is similar to that of reference molecules (blue curves). 2) In contrast, CProMG with property control can generate molecules having better properties. Specifically, the controls of binding affinity, QED, and SA always contribute to high-binding affinities, high QEDs, and low SAs as expected (Table 2, Fig. 2a and b). For example, $99.18% of the novel molecules generated by CProMG-VQS shows QED ! 0:6; while $99.19% shows SA 4:0. Moreover, the controls of LogP and TPSA make generated molecules own the right values of LogP and TPSA around their expectations ( Fig. 2c and d). For example, all cases of CProMG-VQSLT show the peaks of value distributions at the expected property values with small dispersions.
3) It exists a trade-off among the controls over diverse properties. Table 2 exhibits that the smaller LogP results in a smaller SA (better), a smaller QED (worse), and a bigger VS (worse), while the greater TPSA causes a smaller VS (better) and a bigger SA (worse). 4) As shown in Table 2, neither the binding-affinity control nor drug-like property control increases the Diversity, which depends on other modules of CProMG.

Ablation studies
In this section, we investigated how well each component of our model contributes to the prediction by ablation studies in the case of controlling the binding affinity. We made four variants of our original model by only considering the control of binding affinity since it is the prime requirement. Each variant masks one block of CProMG, which helps generate molecules

CProMG i333
having high-binding affinity with specific proteins. The first one only considers the atom view, ignoring the amino acid view (denoted as w/o AA), while the second ignores the atom view (denoted as w/o Atom). The third removes the Laplacian position encoding (denoted as w/o LPE). The last (denoted as w/o C) removes the conditional control of binding affinity (i.e. docking score). The comparison shows that CProMG significantly outperforms all the variants on the VS and HAR (Table 3, where Pvalues are in parenthesis). The results demonstrate that both the amino acid view and the atom view contribute to proteinoriented molecule generation because they provide coarsegrained representations and fine-grained representations of protein binding pockets respectively. Also, the Laplacian positional encoding has a untrivial contribution to proteinoriented molecule generation because it can extract the unique positional representations of protein binding pockets. Last, the results reveal again that the conditionally control of binding affinity is crucial to generating molecules with highbinding affinity to specific proteins.
In general, the amino acid view encoder, the atom view encoder, the Laplacian position encoding and the property control play indispensable roles when generating protein-oriented molecules with desired binding affinity. Similarly, we also investigated how the conditional control of other properties affects the molecule generation. Similar results were found.

Case studies
As Peng et al. (2022) did, we selected the protein (PID: 5I0B) in the testing set as a case study. Its mutations are detected in multiple tumor issues. After running CProMG-VQS, we selected its top-5 generated molecules in terms of VS, and apply RDKit to calculate their values of QED, SA, LogP, and TPSA (Fig. 3). We found that their SA ¼ 2.827 and QED ¼ 0.789 on average. In addition, each molecule satisfies the conditions of QED ! 0:6; SA 4:0, while its LogP and TPSA fits the RO5. This demonstrates that the generated compounds are easy to be synthesized and have good drug-like properties.
Looking into the binding pocket by the Autodock Vina (Trott and Olson 2010). We found that the reference inhibitor molecule has stable polar contacts with the two surrounding residues (i.e. ASP-458 and LEU-398). Due to the dual-view fusion encoder and the decoder, five generate molecules retain polar contacts with at least one of these residues. Moreover, there are also polar contacts with other surrounding residues, such as GLU-323 in R1, GLY-330 in R2, and both SER-457 and GLU 396 in R5.
In addition, the structures of the generated molecules are significantly dissimilar to that of the reference molecule (i.e. 0.221, 0.222, 0.245, 0.228, and 0.263 in terms of Tanimoto similarity). The results validate that the molecules generated are novel.
In summary, the case study demonstrates that novel molecules generated by our CProMG can not only bind to given specific proteins in high affinity but also own desired druglike properties.

Conclusion
In this article, we have proposed a protein-oriented generative framework for molecule generation (CProMG) under the control of high-binding affinity and desired drug-like properties. CProMG contains a 3D protein embedding module, a dualview protein encoder, a molecule embedding module, and a novel drug-like molecule decoder. This end-to-end framework can address two existing issues, including inadequate protein representation and incontrollable generation in properties. LogP: 1.70 TPSA: 120.14 Generated (R5) Figure 3. Case study. The reference molecule is located in the left column while top-5 generated molecules are listed in a descending ordered w.r.t. VS in the remaining columns. Both their binding affinity (VS) and four drug-like properties (QED, SA, LogP, and TPSA) are annotated as well. i334 Li et al.
The comparison with recently published deep generative methods demonstrates the superiority of CProMG. Moreover, the progressive Property-control, the ablation studies as well as the case study validate its contributions. First, CProMG provides a comprehensive framework to generate novel molecules for given proteins with high-binding affinities and desired drug-like properties. Secondly, by fusing the hierarchical views of proteins, it significantly enhances the characterization of protein binding pockets by associating amino acid residues with their comprising atoms. Thirdly, the protein-interactive multi-head attention block in the decoder calculates the proximity of molecule tokens to protein residues and atoms, such that crucial interactions between protein pockets and molecules can be captured.
In summary, we believe that our study provides new insights into molecule generation for de novo drug design.